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We present enhancements to the computational efficiency of exact exchange calculations using 
the density matrix and local support functions. We introduce a numerical method which avoids the 
explicit calculation the four-center two-electron repulsion integrals and reduces the prefactor scaling 
by a factor N, where N is the number of atoms within the range of the exact exchange Hamiltonian. 
This approach is based on a contraction-reduction scheme, and takes advantage of the discretization 
space which enables the direct summation over the support functions in a localized space. Using 
the sparsity property of the density matrix, the scaling of the prefactor can be further reduced to 
reach asymptotically O(N). 



PACS numbers: 71.15.Dx 71.15.Ap 71. 15. Mb 02.60.Jh 



The calculation of exchange energy as found in 
the "Fock-exchange" for Hartree-Fock (HF) theory or 
"exact-exchange" for Kohn-Sham (KS) density func- 
tional theory (DFT) is well-known to be time consum- 
ing, where for a naive implementation the scaling in- 
creases with the fourth power of the number of atoms, 
N, or the number of basis states. It has therefore 
been the focus of considerable efforts to improve both 
the efficiency and the scaling. Much of the work has 
taken place within the quantum chemistry community 
focussing on approaches using Gaussian-type orbitals 
(GTO) or exponential-type functions for the radial part 
such as the well-known Slater- type orbitals (STO). More 
recently, there has been interest in efficient implementa- 
tion within the periodic density functional theory com- 
munity, which traditionally use plane waves as basis func- 
tions (along with atomic pseudopotentials) and required 
discrete fast Fourier transform (FFT) technologies. At 
the same time, linear scaling, or 0(N), approaches to 
finding the electronic grou nd state have emerged over 
the last ten to fifteen yearsj^l and new schemes for the 
evaluation of exact exchange energy have to be developed 
including the specifications of the O(N) techniques. We 
report a novel approach to improving the efficiency of 
exchange calculations for any localised basis functions, 
which fits naturally within the formalism of linear scal- 
ing DFT calculations ; more specifically with the code 
CONQUEST^S. 



Within the framework of the HF theory, the exchange 



energy for a closed-shell system can be written as: 

using the definition of the density matrix in terms of the 
molecular eigenstates ip n (r), 



p(r,r') = 2^^„(r')C(r), 



(2) 



where n runs over the doubly occupied orbitals. This 
yields a more explicit expression for the exchange energy: 



E x = - ^ J droit 



,^(r)C(r')^(r)V m (r') 



(3) 



The typical procedure in quantum chemistry is to express 
the exchange energy of Eq. ^ in terms of 4-center 2- 
electron repulsion integrals (ERI) by expanding ip n (r) 
onto a linear combination of real atom-centered functions 
<Pi{v). Using the standard notation, the ERI formally 
reads: 



(ik\lj) = J droit 



,tp i ( r)tp k (r)tpi(r')tp j (r') 
Ir-r'l 



(4) 



The earliest approaches to linear scaling exchange used 
pre-screening on the integrals and the density matrix 
based on an as sum ed decay rate ; see for instance the 
LinKPand ONXpBH algorithms. These analytical methods 
have been successfully applied in various quantum chem- 
istry codes, though relying on specific basis sets such as 
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Gaussian- type orbitals (GTO). Alternative efficient so- 
lutions generally make the use of a 3-center reduction 
scheme,™^ deriving from the density-fitting approach of 
Baerends and Roos for Slate r-type orbtials (STO) Poland 
Dunlap and coworker o 15 ^ for GTO basis sets. Differ- 
ent app roach es with a similar spirit, such as the pse udo- 
spectraf^E^ or the resolution of identity method^ 19 * 20 ! 
have also demonstrated to be effici ent, and improvements 
are still explored by many groups jl 3 l 21 l 22 l 

The plane wave basis sets common within periodic 
DFT implementations make these approaches impossi- 
ble, and most implementations concentrate on reciprocal 
space using FFTs! 23 '' 25 llt should be mentioned here that 
acceptable accuracy is obtained only if an adequate treat- 
ment of the Coulomb singularities are considered JM®] 
Recently, using a transformation of the Kohn-Sham or- 
bitals to maximally localised Wannier functions, a linear 
scaling calculation of the exchange potential has been 
demonstrated.^ Localised numerical orbital DFT ap- 
proaches to exchange incl ude th e semi-analytic solution 
given by Toyoda and OzakP^H combining fast-spherical 
Bessel transform for the radial integration and a more 
traditional analytic method for the spherical harmonic 
part. A numerical scheme has also been proposed by 
Shang et alP^ where ERI are computed by solving nu- 
merically the Poisson's equation for each localized pair- 
density pij = ifiifj, and integrating in real space. Sim- 
ilarly to planewave periodic exchange calculations, the 
main drawback resides in the accuracy of the Poisson 
solver. All the methods outlined above involves the ex- 
plicit calculation of the full or screened set of ERIs. 

We introduce instead a route which circumvents the 
calculation of the four-center integrals and works for any 
smooth finite-range functions, which is particulary well 
suited for O(N) approaches. In standard linear scaling 
theory, the density matrix is used as the fundamental 
variable and is written in a separable form in terms of 
localised orbitals, also called support functions (f>i(r), 



with, 



p(r,r')=2^^(r)^^(r') 



(5) 



where Kij is the density matrix in the representation of 
the support functions, also known as the density kernel. 
Linear scaling is achieved when the support functions, 
centred on the atomic positions Ri, are strictly localised 
in space and a cutoff is applied to Kij so that, 



K M ; = 0for \Ri-RA >R K , 



(6) 



with Rk the density matrix range. From Eq. ^ we can 
therefore write the exchange energy as: 



E x = - ]T J dvdt 

ijkl 



l <j> i (r)K ij <j)j(r')(f) k (r)K k i<f>t(r') 



y . K ij X i. 



(7) 
(8) 



kl J 



drd? 



(9) 



The exchange matrix X becomes now the key quantity 
to calculate, and for R^ — > oo, the resulting exchange 
energy must be exact. We note that this form involves 
a contraction between K and one set of local orbitals, 
seen as the sum over I (or k) in Eq. ([9]). This type of 
contraction is frequently performed in Conquest^, but 
in this case will reduce the prefactor for exchange energy 
calculation by removing one of the four centres of the 
ERI. Moreover, if we apply a range Rx to the exchange, 
such as 



Xij = for 



Ri — Rj\ > Rx ; 



(10) 



we can then achieve linear scaling with the prefactor de- 
pending on the localisation of the matrix. The resulting 
method is not only efficient, but should be scalable in 
parallel, as it is compatible with the standard Conquest 
approach to matrix and support function operations. 

As mentioned in the introduction, the key part is 
to perform the sum over the index I before solving for 
the Coulomb potential of the pair densities; this sim- 
ple re-ordering increases the efficiency of the procedure 
markedly, as we will show below. We define new contrac- 
tion functions, $fc(r'), as: 



<Mr') 



K 



A-/C 



(r') 



(11) 



It should be outlined that the domain over which these 
functions arc defined requires some care; this is detailed 
in the Appendix. The sum over I need only include those 
support functions 4>i overlapping with <pj, as & k will be 
multiplied by this function. Contracted densities are then 
defined as: 



- pkj {r') = <S> k (r')4> j {r') : 
and the resulting Coulomb potential, 



v k j(r) 



dr' 



(12) 



(13) 



is calculated by solving Poisson's equation using, for in- 
stance, numerical FFT routines. As we discuss later, 
once the potential has been found a further contraction 
over k is performed to create the function Qj(r), as: 



(14) 



where, again, the sum over support functions k need only 
include those functions which overlap with support func- 
tion i. The matrix elements Xij are then calculated by 
integration: 



dr4>i(r)Qj(r). 



(15) 
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The set of function Qj is effectively defined by the den- 
sity matrix range -normally applied to Kki in accordance 
with Eq. ([6])- and the need for j to overlap with atoms I. 
There is therefore a clear route to efficient linear scaling 
exchange calculations within the standard approaches of 
O(N) electronic structure codes. 
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FIG. 1. Comparison of CPU times necessary to compute EXX 
in isolated water clusters as a function of number atoms (N) 
using explicit ERI calculation and the CRI method. Ideal N 4 
and TV 3 scalings are given by plain lines. 
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FIG. 2. Variation of CPU time with respect to the range Rx 
(in a.u.) for the calculation of EXX in isolated water clusters 
using the CRI method. 



As previously remarked (see Eq. (10)), the calculation 



time can be reduced further with a screening condition 
on exchange matrix elements Xij . This is related to the 
sparsity property of p(r, r')J^ and the truncation of all 
the operators involved in the HamiltonianP^ From the 
algorithm of Fig. 5] (see Appendix), we note that the eval- 
uation of the reduced potential Vkj is performed within 



a 3-index loop, which contrasts with the 4-index loop 
(over a limited set of atoms) used for the accumulation 
of temporary matrix Another possibility would be 
to compute and store the set {cf>i} (and {4>k}) once, re- 
ducing formally -after the first cycle- the execution time 
for the calculation of $k to N 3 (N 2 ) but increasing the 
data storage by N(N 2 ), respectively. 

Practical tests on the efficiency of the contraction- 
reduction integral (CRI) algorithm were performed on 
a set of isolated water clusters (H2 0)„ (n < 20) with 
fused cubes structures taken from the work of Wales and 
Hodges Calculations of exchange energy were realized 
after the KS density matrix has been converged using 
the standard self-consistent-field (SCF) method. As a 
result, the timings presented below for exact exchange 
(EXX) energy can be compared to a single SCF cycle as 
found in HF or hybrid-DFT calculation. For this demon- 
stration, single-C numerical pseudo-atomic orbitals^^H 
(NAO) have been used for hydrogen and oxygen with 
cutoff radii of 4.7 and 3.8 au, respectively. We emphasize 
that the main conclusions of this work can be easily ex- 
tended to more flexible basis sets, as far as the support 
functions are localized. SCF-KS and post-EXX calcula- 
tions were performed with a fixed grid spacing of 0.25 
au for the NAO discretization. This protocol allows us 
to realize fast enough computations on a single processor 
and also to draw qualitative conclusions on the exchange 
matrix range. 

The central processing unit (CPU) times used for the 
computation of EXX are reported in Fig. [T] as a func- 
tion of the number of atoms, for various water clusters, 
using: (i) the explicit evaluation of the full set of ERI, 
(ii) the CRI approach, and (iii) the CRI approach with 
partial storage of the NAO during the construction of 
the temporary matrix involved in the 4-index loop 
(see Appendix). Comparing the formal scalings obtained 
for the CRI methods against the full ERI approach, it 
becomes clear that the contraction-reduction algorithm 
reduces the quartic scaling to to cubic scaling with re- 
spect to the size of the water clusters. Timings can be 
further reduced by requiring the storage of the NAOs 
(the set {0;})- At this point we should emphasise that 
exchange energy values obtained with the three schemes 
are fully identical, their accuracies being only dependent 
on the Poisson solver used to evaluate the pair potential 



in Eq. (13) 



Among the various numerical FFT-based methods, one 
can choose to evaluate the Coulomb potential in recipro- 
cal or real space. Whereas the former is the most appro- 
priate for periodic neutral systems -when the positively 
charged nuclei compensate exactly the electronic charge 
density- it becomes less efficient for isolated and/or 
charged systems.^ Several schemes have been developed 
to tackle this problempSEI Alternatives based on the 
discrete variable representation (DVR) of Eq. ( 13 ) which 
avoids the direct resolution of the Poisson equation have 
been proposed.^ The density is generally expanded in 
a direct product of one-dimensional localized real-space 
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basis functioning as for instance, interpolating scaling 
functions (ISF) . After extended comp arison s between the 
DVR-ISF developed by Genovese et al! 48 | 49 | and corrected 
reciprocal FFT-based schemes ISSHSU we found that sys- 
tematic convergence of the ERI is obtained with a better 
accuracy and at a lower cost using the real space Poisson 
solver 




Number of atoms 

FIG. 3. Cumulative decomposition of the total execution time 
for EXX calculation in isolated water clusters using the CRI 
method along with the storage of the NAO and a screening 
of Rx = 7.0 au. Contributions of the three main routines 
are presented: the resolution of the Poisson equation, the 
accumulation in the temporary matrices <£> and CI, and the 
NAO discretization . 

As shown in Fig. [2j if integral screenings is introduced 
within the CRI algorithm -see Fig. [5] of the Appendix- 
the CPU time can be significantly reduced, allowing to 
reach the O(N) regime for clusters with more than 36 
atoms (at Rx — 7.0 au). Computational ressources 
further decrease for shorter EXX range along with the 
faster observation of the linear-scaling regime. Figure [3] 
presents the decomposition of the total execution time 
involved the calculation of the EXX using the CRI ap- 
proach for Rx = 7.0 au. As it would be expected, most 
of the time is spent on the accumulation in the temporary 
matrices and fl,-, and the Poisson solver. Evaluation 
of the support functions on the cubic grid do not impact 
to much on timings as far as the reduced overlap space 
technique is considered (see Appendix). 

The post-EXX accuracy with respect to Rx is given 
in Fig. [4] for the cluster (H20)2o using the "boxkite" 
structure^ which is characterized by an edge length 
around 25.5 au. Because in the present study calculated 
EXX energy is not variational with respect to Rx, we 
do not expect a monotonic behavior for the plotted con- 
vergence profiles on Fig. [4j where an accuracy below 1 
mHa is found for Rx > 6.5. This can be compared to 
the density matrix convergence of 10 -7 Ha obtained at 
Rk = 5.0 au. As a result, the price to paid for the fast 



computation of exchange energy using the CRI algorithm 
is the reduction of the accuracy, which in our case is ac- 
ceptable considering the size of the system. It should be 
mentioned that the constant exchange cutoff used in the 
3-index loop of Fig.[5]can be different at the three stages. 
This will allow acceleration of the convergence without 
significantly affecting the efficiency. This linear scaling 
approach will scale in the same way as the other proce- 
dures in Conquest, opening the way to efficient exact 
exchange calculations on 100,000+ atoms. 
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FIG. 4. Convergence of the posi-EXX energy with respect 
to the exchange range Rx for the cluster (H20)2o- Error is 
given with respect to the exact calculation. 

Finally, in this work we have shown that we are able to 
circumvent the ./V 4 scaling inferred by the standard cal- 
culation of exchange without any approximation. Even 
if the non-local nature of the EXX interaction requires a 
larger range compared to standard O(N) DFT implemen- 
tation, the linear-scaling regime is observable for a fair 
efficiency/accuracy ratio. In our case, computation time 
can be further reduced the fact that FFT-based Pois- 
son' solvers are easily parallelizable along with a judicious 
choice of the EXX matrix range. 
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Appendix A: Implementation 



radial functions and spherical harmonics are evaluated 



To describe the practical implementation of the 
contraction-reduction integral algorithm we have to start 
from the explicit definition of the exchange matrix ele- 
ments with respect to the ERIs and the basis set {<f>i\. 
Within the discretized space Eq. ([9| can be written as: 



X, 



hi hg 



a 4>i{ v h 



: (r h - B. k )<8(Th,r g ) 



Ri)h{r g - Rj)w{r h )w(r g ), (Al) 



where i?(r/,,r ff ) represents the 2-e lect ron Coulomb oper- 
ator. We made explicit in Eq. (All the fact that the 



support functions are centered on the nuclei positions 

{RJ. 

The sets {ui(r^)} and {w(r g )} account for the weight 
factors of the quadrature points {r^} and {r g }. We 
choose to work with an evenly spaced cubic grid where 
both w(rh) and w(r a ) simplify to Wi Dt = hf nt , with h m t 
the grid spacing PS Under the translation r — > r + R i; 
which leaves invariant the ERI, we obtain 



X; 



kl 



hg 



Rfei)i?(r^,r g 



R;, 



M r r%)»L (A2) 



using Ry = Ri — Rj . By virtue of the linearity of dis- 
cretized space, we are allowed to introduce the temporary 
matrix: 



$ fc (r g ;{R4) = ^T f K kl (j> l (r g - R H ). 



(A3) 



We emphasize that <fii is evaluated on a cubic grid cen- 
tered on the nucleus i. The explicit expression for the 
reduced density of Eq. (Il2| is given by, 



p kj (r g ; Rji, {Rji}) = $fc(r s )<^-(r g - R^). (A4) 

The corresponding reduced pair potential tiy is obtained 
by solving the Poisson equation. Finally, we introduce 
the temporary matrix: 



fij(r/j; Rj,-, {Rii}, {Rfei}) 



(j> k (r h -R. ki )v kj (r h ), (A5) 



k 

to perform the last numerical integration, yielding the 
exchange matrix elements: 

Xij\Th; Rji, {R;i}j {Rfci}) 

= s ^4> l (v h )£l j (r h )w iYl t- (A6) 

h 

We have made the dependence of the various matrix el- 
ements on the translation vectors {R^} explicit. The 
CRI approach involves three main operations: (i) The 
projection of <pi onto the discretized space, where both 



FIG. 5. Algorithm describing exchange kernel formally scaling 
as iV 3 . 
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loop over atom i 

l> evaluate and r> store 4>i 
loop over atom j 
if Rji < Rx then 

r> evaluate and l> store 4>j 
loop over atom k 
if Rki < Rx then 

[> evaluate 4>k and l> store ? 
loop over atom / 
> fetch K k i 
if Rn < Rx then 

[> evaluate (j>i and r> store ? 
o accumulate $fc 
end if: Rx 
end loop: / 
> calculate pkj 
[> evaluate Vkj 
o accumulate flj 
end if: Rx 
end loop: k 
end if: Rx 
end loop: j 
t> integrate Xij 
end loop: i 



on a cubic grid, (ii) The summations of Eqs. (A3) and 
(A5). (iii) The evaluation of the pair potential v k j- 

The combination of local FFT grids^with the locality 
property of the NAO easily fulfils the efficiency require- 
ment. On each primary atom i a box is centered at the 
position R.;. This box contains an ensemble of grid points 
called Bi. For the NAO set {j,k,l} in Eq. (|A2]) other 



boxes B a are defined and translated along the vector 
R a ;. Here, we choose to work with identical cubic boxes 
of length L > 2 x r™ ax , where r™ ax is the largest con- 
finement radius over the whole set of contracted support 
functions. Considering that the quadrature of Eq. (A2) 



is different from zero if significant overlap is deemed to 
exist between the orbital-pairs ik and Ij, we can first 
reduced the computational resources involved in (i) by 
defining reduced spaces as, 



O ab = B a nB b 



(A7) 



where O ab is the overlap box of cj> a with (f) b . Then the 
discretization of {<j) k ,<pi} is only realized for grid points 
common to the space span by (pi and <p k , respectively. 
Secondly, by using the fact that the coordinate system 
is centered on the primary atom, we can introduced an 
efficient screening during the course of the calculation 
and reduced the computational time related to (ii) and 
(iii). Accumulation in the temporary matrices $fe and 
flj, which are centered on atom i, is performed if the dis- 
tance R a i between the two distribution centers is below 
the EXX cutoff R x . 
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